Introduction

For this project I chose to work with he 2016 presidential campaign finance data, I am very much interested in the link between candidates and campaign finance and would like to see big money out of politics for various reasons. This gives me a personal interest in the data, which will hopefully help me to find real insight in what I’m looking at.

Front matter

I obtained the data from the FEC.gov website here: http://fec.gov/disclosurep/pnational.do on 6/20/2016. I downloaded the entire data set, in order to give me the most data to work with. My intent is to subset this data into smaller more workable sets. I downloaded the file and put the csv in the same folder as my .rmd file for ease of access. I named the file 2016Presidential_Election_Contributions.csv for clarity.

Let’s load some libraries into R that I will make use of through this analysis.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Getting familiar with the data

This is a big data set and the first step is to figure out what all is contained in it.

#ec <- read.csv('2016Presidential_Election_Contributions.csv') # gave errors
ec <- read.csv('2016Presidential_Election_Contributions.csv', row.names = NULL)
colnames(ec) <- c(colnames(ec)[-1],"x")
ec$x <- NULL

Right away I ran into an issue, I got an error message telling me duplicate row.names are not allowed. A quick Google search brought me to this page: http://stackoverflow.com/questions/13239639/duplicate-row-names-error-reading-table-row-names-null-shifts-columns which lead me to formatting the read.csv command as above in the uncommented section as well as adding the two subsequent lines to shift the column names and remove the last column.

Data Set Statistics

After loading the data set we can see there are 3,269,914 observations of 18 variables. 15 of the variables were identified at import as factors 1 contained chr data, 1 contained num data, 1 contained int data.
Here are the variables:

names(ec)
##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_employer"   "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt"  "receipt_desc"     
## [13] "memo_cd"           "memo_text"         "form_tp"          
## [16] "file_num"          "tran_id"           "election_tp"

Of particular interest are:

  • cand_*- which seem to correspond to the candidate receiving the donations
  • contbr_*- which seem to correspond to the particular contributor
  • contb_* which seems to correspond the the actual contribution
  • election_tp- which looks at first glace to be an indication for which election the contribution is for

Here are the list of all 24 candidates:

unique(ec$cand_nm)
##  [1] Rubio, Marco              Santorum, Richard J.     
##  [3] Perry, James R. (Rick)    Carson, Benjamin S.      
##  [5] Cruz, Rafael Edward 'Ted' Paul, Rand               
##  [7] Clinton, Hillary Rodham   Sanders, Bernard         
##  [9] Fiorina, Carly            Huckabee, Mike           
## [11] Pataki, George E.         O'Malley, Martin Joseph  
## [13] Graham, Lindsey O.        Bush, Jeb                
## [15] Trump, Donald J.          Jindal, Bobby            
## [17] Christie, Christopher J.  Walker, Scott            
## [19] Stein, Jill               Webb, James Henry Jr.    
## [21] Kasich, John R.           Gilmore, James S IIII    
## [23] Lessig, Lawrence          Johnson, Gary            
## 24 Levels: Bush, Jeb Carson, Benjamin S. ... Webb, James Henry Jr.

No surprises here, and some basic statistics about the contribution amounts:

summary(ec$contb_receipt_amt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -84240      15      35     142     100 4905000

Some surprising things here:

  • min contributions are negative, indicating this includes refunds to donors
  • the max is $4.9 Million, which right away to me indicated this data set was not only for personal contributions to candidates since the personal limit for donations is $2,700
  • a large difference between the mean and median

Let’s add some stuff

I know I will want to add Party (Dem/Rep/Green) as well as Gender (M/F/O) to my data, so lets do that right away.

female_candidates <- c('Clinton, Hillary Rodham', 'Fiorina, Carly', 'Stein, Jill')
democratic_candidates <- c('Clinton, Hillary Rodham','Sanders, Bernard', "O'Malley, Martin Joseph","Webb, James Henry Jr.", "Lessig, Lawrence")
green_candidates <- c('Stein, Jill')
ec$gender <- ifelse(ec$cand_nm %in% female_candidates, 'Female', 'Male')
ec$party <- ifelse(ec$cand_nm %in% democratic_candidates, 'Democrat', ifelse(ec$cand_nm %in% green_candidates, 'Green', 'Republican'))

That is better, honestly I’m somewhat surprised this wasn’t included in the data set.

First look at the data

I am chomping at the bit to see what this data looks like, and one of the main factors is the contribution amounts.

ggplot(aes(x=contb_receipt_amt), data = ec) + geom_histogram(binwidth=100) 

This is pretty unreadable, the chart shows, what the summary statistics indicated, that the data is a bit spread out. It is difficult to see any information on the chart.

Let’s try to filter things down a bit.

ggplot(aes(x=contb_receipt_amt), data = subset(ec, contb_receipt_amt > 0)) + 
  geom_histogram(binwidth=50) + xlim(0,quantile(ec$contb_receipt_amt, .9))
## Warning: Removed 223192 rows containing non-finite values (stat_bin).

Now this is better. Here we are looking just at contributions > 0 and under the 90% statistic for donations. You can also see, clearly that the most contributions come in the first few hundred dollars.

I wonder if the same is true of returned contributions?

ggplot(aes(x=contb_receipt_amt), data = subset(ec, contb_receipt_amt < 0)) + 
  geom_histogram(binwidth=50) + xlim(-5000,0)
## Warning: Removed 295 rows containing non-finite values (stat_bin).

Interesting that the negative contributions have a different shape to the histogram then the positive contributions. I wonder why there is such a huge spike of negative contributions around -$2,700. My guess would be that people might have mistakenly contributed or been charged twice for a maximum contribution, and subsequent contributions were refunded. This seems reasonable considering $2,700 is the max personal contribution amount by individuals.

What about those high dollar and negative contributions? What are those about?

I had to find out so lets take a look at the top 10 negative dollar contributions

head(ec[order(ec$contb_receipt_amt),
        c("cand_nm","contbr_nm", "contbr_employer", "contb_receipt_amt", "contb_receipt_dt", "memo_text", "receipt_desc")],
     n=10)
##                           cand_nm         contbr_nm contbr_employer
## 3222208          Trump, Donald J.       BOCH, ERNIE                
## 1111065   Clinton, Hillary Rodham GOCKE, THOMAS DR.                
## 276496        Carson, Benjamin S.         REID, SUE                
## 820693  Cruz, Rafael Edward 'Ted' CLARK, ELLOINE M.                
## 148302        Carson, Benjamin S.  HILLMAN, TATNALL                
## 820694  Cruz, Rafael Edward 'Ted'       DURHAM, JOE                
## 276649        Carson, Benjamin S.     SUMIDA, ALICE                
## 278284        Carson, Benjamin S.     SUMIDA, ALICE                
## 1975289          Sanders, Bernard  BARCOCK, BELINDA                
## 148303        Carson, Benjamin S.  HILLMAN, TATNALL                
##         contb_receipt_amt contb_receipt_dt memo_text receipt_desc
## 3222208          -84236.8        30-SEP-15                 Refund
## 1111065          -20000.0        30-JUN-15                 Refund
## 276496           -18100.0        13-APR-16                 Refund
## 820693           -16600.0        12-APR-16                 Refund
## 148302           -16300.0        21-OCT-15                 Refund
## 820694           -13500.0        15-APR-16                 Refund
## 276649           -13000.0        01-MAR-16                 Refund
## 278284           -12242.0        31-DEC-15                 Refund
## 1975289          -12000.0        10-OCT-15                 Refund
## 148303           -11700.0        15-DEC-15                 Refund

This list shows the largest 10 negative contribution amounts. I limited the columns so that the cruft was removed in hopes to have some light shed on the issue of where these contributions came from and gain some insight into how they might come about.

The top negative contribution was for an Ernie Boch who was refunded $84,236.80 from Trump. A quick Google search shows this result relating to the incident: http://www.boston.com/news/local/massachusetts/2015/10/15/ernie-boch-tried-donate-way-way-too-much-money-donald-trump-gets-refunded/1QhFWWAIzyyK4Jc79xB5aO/story.html which indicates that Boch made contributions in kind to the Trump campaign in the form of paying for catering at an event for Trump, the amount exceeded the contribution limit as set forth by the FEC.

What other contributions has Boch made?

subset(ec[, c("cand_nm","contbr_nm", "contbr_employer", "contb_receipt_amt", "contb_receipt_dt", "memo_text", "receipt_desc")], ec$contbr_nm == 'BOCH, ERNIE')
##                  cand_nm   contbr_nm       contbr_employer
## 3222208 Trump, Donald J. BOCH, ERNIE                      
## 3222223 Trump, Donald J. BOCH, ERNIE BOCH AUTOMOTIVE GROUP
##         contb_receipt_amt contb_receipt_dt memo_text receipt_desc
## 3222208          -84236.8        30-SEP-15                 Refund
## 3222223           86936.8        28-AUG-15

One Donation for $86,936.80 and one refund for $84,236.80 which is a difference of

86936.80 - 84236.80
## [1] 2700

Which is the personal contribution limit for a person to give to a candidate. Checks out.

That explains the biggest, what about the next? Using Similar research methods I discovered that Dr. Thomas Gocke donated $20,000 to Clinton on 6/29/2015 and as per the filing here: http://docquery.fec.gov/cgi-bin/fecimg/?201509039001609936 was refunded that money on 6/30/2015.

subset(ec[, c("cand_nm","contbr_nm", "contbr_employer", "contb_receipt_amt", "contb_receipt_dt", "memo_text", "receipt_desc")], ec$contbr_nm == 'GOCKE, THOMAS DR.')
##                         cand_nm         contbr_nm contbr_employer
## 1111065 Clinton, Hillary Rodham GOCKE, THOMAS DR.                
##         contb_receipt_amt contb_receipt_dt memo_text receipt_desc
## 1111065            -20000        30-JUN-15                 Refund

uh only one listed, strange. Maybe his name is listed differently someplace else?

subset(ec[, c("cand_nm","contbr_nm", "contbr_employer", "contb_receipt_amt", "contb_receipt_dt", "memo_text", "receipt_desc")], grepl(glob2rx("GOCKE, THOMAS*") , ec$contbr_nm) )
##                         cand_nm         contbr_nm contbr_employer
## 1110442 Clinton, Hillary Rodham     GOCKE, THOMAS   SELF-EMPLOYED
## 1111065 Clinton, Hillary Rodham GOCKE, THOMAS DR.                
##         contb_receipt_amt contb_receipt_dt           memo_text
## 1110442             20000        29-JUN-15 REFUNDED ON 6/30/15
## 1111065            -20000        30-JUN-15                    
##         receipt_desc
## 1110442             
## 1111065       Refund

Thanks to this resource for explaining the usage of fuzzy matching with grepl and glob2rx: http://stackoverflow.com/questions/5823503/pattern-matching-using-a-wildcard

As I suspected he was a Dr when he was refunded but not a Dr when he donated.

So it looks like those are explained pretty well, I will assume that those negative contributions correspond to over payment type contributions. Or mistaken contributions to certain candidates.

What about the BIG donations?

head(ec[order(-ec$contb_receipt_amt),
        c("cand_nm","contbr_nm", "contbr_employer", "contb_receipt_amt", "contb_receipt_dt", "memo_text", "receipt_desc")],
     n=10)
##                         cand_nm                         contbr_nm
## 1359611 Clinton, Hillary Rodham HILLARY VICTORY FUND - UNITEMIZED
## 1327390 Clinton, Hillary Rodham HILLARY VICTORY FUND - UNITEMIZED
## 1368404 Clinton, Hillary Rodham HILLARY VICTORY FUND - UNITEMIZED
## 1341844 Clinton, Hillary Rodham HILLARY VICTORY FUND - UNITEMIZED
## 1340986 Clinton, Hillary Rodham HILLARY VICTORY FUND - UNITEMIZED
## 1347693 Clinton, Hillary Rodham HILLARY VICTORY FUND - UNITEMIZED
## 3222223        Trump, Donald J.                       BOCH, ERNIE
## 3261145         Kasich, John R.        NAP INVESTORS HOLDINGS LLC
## 3261143         Kasich, John R.               B&J DEVELOPMENT LLC
## 1110442 Clinton, Hillary Rodham                     GOCKE, THOMAS
##               contbr_employer contb_receipt_amt contb_receipt_dt
## 1359611                               4904860.5        31-MAR-16
## 1327390                               3686373.3        29-FEB-16
## 1368404                               3600489.1        29-APR-16
## 1341844                               1797624.9        31-DEC-15
## 1340986                               1603724.4        04-DEC-15
## 1347693                               1467070.9        31-JAN-16
## 3222223 BOCH AUTOMOTIVE GROUP           86936.8        28-AUG-15
## 3261145                                 29100.0        30-DEC-15
## 3261143                                 25000.0        30-DEC-15
## 1110442         SELF-EMPLOYED           20000.0        29-JUN-15
##                   memo_text receipt_desc
## 1359611                   *             
## 1327390                   *             
## 1368404                   *             
## 1341844                   *             
## 1340986                   *             
## 1347693                   *             
## 3222223                                 
## 3261145                                 
## 3261143                                 
## 1110442 REFUNDED ON 6/30/15

Some of these large contributions come from PAC groups. The largest 6 donations came from Hillary Victory Fund in the total of:

4904860.5 + 3686373.3 + 3600489.1 + 1797624.9 + 1603724.4 + 1467070.9
## [1] 17060143

Wow, lots of money for Clinton from that PAC.

The other large donations look like they are from other PAC’s (LLC, LLP, etc) or are set to be refunded as over payments. Including Dr (sometimes) Thomas Gocke.

From what I can tell there doesn’t seem to be a good way to identify the PAC vs Individual donations to various candidates from the data. Something more advanced might be able to be developed but there is nothing that is given to us that tells us this. That is a shame.

Back to looking at contributions

How about contributions by candidate?

ggplot(aes(x=cand_nm, y=contb_receipt_amt), data=ec) + 
  geom_bar(stat="summary", fun.y = sum) 

This is difficult to read because of the names of the candidates being so long, it is hard to tell which corresponds to which.

ggplot(aes(x=cand_nm, y=contb_receipt_amt), data=ec) + 
  geom_bar(stat="summary", fun.y = sum) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Alright! Easy to see now that Clinton has received the bulk of donations in terms of sheer dollars, followed by Sanders, then Cruz and Rubio in that order. Well that is total dollar contributions, way to go Hillary, what about average contribution amount?

ggplot(aes(x=cand_nm, y=contb_receipt_amt), data=ec) + 
  geom_bar(stat="summary", fun.y = mean) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Oh a much more even playing field. Here Jindal jumps to the top of the charts, pulling in around $1,600 on average per donation, that seems like a lot per donation. He is followed closely by Pataki, Christi, Gillmore and Bush. Interesting to me are the people at the bottom of this list. If a candidate has small average contributions, but has raised a lot of money we can assume that lots of people are donating, which for any politician is a good thing. Candidates in this camp are notably Sanders, Cruz and Carson.

Let’s take a look at number of contributions then.

 ggplot(aes(x=cand_nm), data=ec) + 
  stat_count(width = 1)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

If you have been watching the news it shouldn’t be all that surprising to find Sanders more then double the number of contributions of his nearest rival. The next highest in number of contributors are Clinton, and Cruz. This group dominates the rest of the candidates in sheer number of contributions.

What about Democrats vs. Republicans vs Green? Who has raised more money

ggplot(aes(x=party, y=contb_receipt_amt/1000000), data=ec) + 
  geom_bar(stat="summary", fun.y = sum) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

This is in millions of dollars. So Democrats are around $260M and looks like Republicans Just over $200M. Green party is far behind with no noticeable money on this chart.

Who overall is getting more money per donation?

ggplot(aes(x=party, y=contb_receipt_amt), data=ec) + 
  geom_bar(stat="summary", fun.y = mean) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Looks like the Democrats are low here, possibly pulled down by Sanders large number of small dollar contributions.

What about donations over time?

ggplot(aes(x=contb_receipt_dt, y=contb_receipt_amt), data=ec) +
  geom_line(aes(color=cand_nm), stat="summary", fun.y= sum)
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

Oops that didn’t give at all what I expected. I wonder if the contb_receipt_dt field needs to be converted into a date?

ec$contb_receipt_dt <- as.Date(ec$contb_receipt_dt, "%d-%b-%y")

There we go all date like. Let’s try the plot again.

ggplot(aes(x=contb_receipt_dt, y=contb_receipt_amt), data=ec) +
  geom_area(aes(color=cand_nm), stat="summary", fun.y= sum)

Much better, but too much information. Let’s see what our date bounds are.

min(ec$contb_receipt_dt)
## [1] "2013-10-01"
max(ec$contb_receipt_dt)
## [1] "2016-04-30"

Looks like this data set goes through to April 30Th, 2016 and starts with contributions all the way back to 2013.

Let’s just look at since March of 2015.

ggplot(aes(x=contb_receipt_dt, y=contb_receipt_amt), 
       data= subset(ec, contb_receipt_dt >= as.Date("2015-03-01"))) +
  geom_area(aes(color=cand_nm), stat="summary", fun.y= sum)

This is better, but I think the graph just doesn’t quite look right, it doesn’t make sense that the total dollars would fluctuate up and down like that I think this isn’t showing a proper cumulative summary like I was expecting, but a sum of each candidates donations on each date.

I had a really hard time coming up with the right way to do this just using ggplot, in the end I created another data frame and grouped the data by candidate,and date and summed the contributions on that date, then cumulatively summed them, then used that data frame in the ggplot viola! Credit to this site for helping me break through that challenge. http://www.markhneedham.com/blog/2014/08/31/r-ggplot-cumulative-frequency-graphs/ I tried lots of stuff, but this was the one that helped me get on the right path.

ec.day_group <- ec %>% group_by(party, cand_nm, contb_receipt_dt) %>% summarize(daily_contributions = sum(contb_receipt_amt), n = n()) %>% mutate(daily_contributions = cumsum(daily_contributions), n = cumsum(n)) 
ggplot(aes(x=contb_receipt_dt, y=daily_contributions/100000), 
       data= subset(ec.day_group, contb_receipt_dt >= as.Date("2015-03-01"))) +
  geom_line(aes(color=cand_nm))

Look at Hillary shoot up, wow!

What about number of donations?

ggplot(aes(x=contb_receipt_dt, y=n), 
       data= subset(ec.day_group, contb_receipt_dt >= as.Date("2015-03-01"))) +
  geom_line(aes(color=cand_nm))

What about Democrats vs republicans vs green, lets add a facet

ggplot(aes(x=contb_receipt_dt, y=daily_contributions/100000), 
       data= subset(ec.day_group, contb_receipt_dt >= as.Date("2015-03-01"))) +
  geom_line(aes(color=cand_nm)) + 
  facet_wrap(~party)

The Republican contributions are much more competitive within the party, while the Democrats are dominated by Clinton. What about the same breakdown but for number of donations total?

ggplot(aes(x=contb_receipt_dt, y=n), 
       data= subset(ec.day_group, contb_receipt_dt >= as.Date("2015-03-01"))) +
  geom_line(aes(color=cand_nm)) + 
  facet_wrap(~party)

Two big superstars from each group here. Clinton/Sanders, and Cruz/Carson.

Let’s take a look at how the debates impacted number of donations to the candidates. Dates were taken from http://www.uspresidentialelectionnews.com/2016-debate-schedule/2016-republican-primary-debate-schedule/ and http://www.uspresidentialelectionnews.com/2016-debate-schedule/2016-democratic-primary-debate-schedule/

My thinking being, did the debate have a significant impact on the minds of voters and where they are spending donations?

ggplot(aes(x=contb_receipt_dt, y=n), 
       data= subset(ec.day_group, contb_receipt_dt >= as.Date("2015-08-01"))) +
  geom_line(aes(color=cand_nm)) +
  geom_vline(xintercept = as.numeric(as.Date("2015-08-06")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-09-16")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-10-28")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-11-10")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-12-15")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-01-14")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-01-28")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-06")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-13")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-25")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-03")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-10")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-21")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-10-13")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2015-11-14")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2015-12-19")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-01-17")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-01-25")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-04")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-11")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-06")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-09")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-04-14")), colour = "blue")

Since the data set only goes through April 30, 2016 this only includes debates that happened before then.

I like this visualization but I don’t think I see any strong indications that donations increased as a result of a debate. Too bad, I was hoping to see that kind of insight from this chart.

What about the same information but for total dollars?

ggplot(aes(x=contb_receipt_dt, y=daily_contributions/100000), 
       data= subset(ec.day_group, contb_receipt_dt >= as.Date("2015-08-01"))) +
  geom_line(aes(color=cand_nm)) +
  geom_vline(xintercept = as.numeric(as.Date("2015-08-06")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-09-16")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-10-28")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-11-10")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-12-15")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-01-14")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-01-28")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-06")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-13")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-25")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-03")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-10")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-21")), colour = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2015-10-13")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2015-11-14")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2015-12-19")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-01-17")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-01-25")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-04")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-02-11")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-06")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-03-09")), colour = "blue")+
  geom_vline(xintercept = as.numeric(as.Date("2016-04-14")), colour = "blue")

Again an interesting visualization but there doesn’t seem to be any obvious link between debates and contribution amount or number.

Let’s focus on the Democrats around the last debate let’s filter the data after 4/1/2016

ggplot(aes(x=contb_receipt_dt, y=n), 
       data= subset(ec.day_group, 
                    contb_receipt_dt >= as.Date("2016-04-01") &
                      party == 'Democrat'
                    )) +
  geom_line(aes(color=cand_nm)) +
  geom_vline(xintercept = as.numeric(as.Date("2016-04-14")), colour = "blue")

I did some research trying to find who (if any) had an overall better performance in this debate. Not an exhaustive search but using these sources:

The 4 sources are split in who they say did best in the debate. All seem to agree that it was a close fight. Again this doesn’t seem to give any insight into the pace of donations around these debates. Overall the pace of donations seems to remain about the same. In this chart we can see Sanders getting more donations per day, which seems to continue through the end of April. There just doesn’t seem to be any jumps or changes related to the debates. This seems to indicate that the people making donations don’t really seem too affected by the debates, or that any changes in contribution amounts or frequency is completely dwarfed by the amount of contributions so far and the changes aren’t readily apparent.

Final Plots and Summary

In selecting three plots I want to make sure that the end results are readable and convey some good information. Along those lines I chose Number of Contributions, Total Donations over time, Contributions Around the latest Democratic Debate to plot. Below are my tweaks and adjustments to make each plot production ready.

Number of Contributions through the campaign

Here is the chart from earlier in the analysis:

 ggplot(aes(x=cand_nm), data=ec) + 
  stat_count(width = 1)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

First lets see about cleaning up the axis labels.

 ggplot(aes(x=cand_nm), data=ec) + 
  stat_count(width = 1)+
  ggtitle('Number of Donations by Candidate')+
  labs(x='Candidate', y='Number of Donations')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Looking better already. Gave the chart a title and named the axis to make them more readable. I also changed the angle of the candidate names from 90 to 45 in order to save a bit of room on the screen.

How about adding some color?

 ggplot(aes(x=cand_nm, fill=party), data=ec) + 
  stat_count(width = 1)+
  ggtitle('Number of Donations by Candidate')+
  labs(x='Candidate', y='Number of Donations')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

that is pretty nice, but the colors are off. Lets see if we can specify the exact colors we want the parties to be. I found some information here that helped me get the colors set manually: * http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually#change-colors-by-groups

Color codes came from:

 ggplot(aes(x=cand_nm, fill=party), data=ec) + 
  stat_count(width = 1)+
  scale_fill_manual(values=c("#232066", "#00A95C", "#E91D0E"))+
  ggtitle('Number of Donations by Candidate')+
  labs(x='Candidate', y='Number of Donations')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Looking much better so far. I see a few things left to change. The label on the legend isn’t capitalized, a minor point and the data labels for the y axis are unnecessarily large, maybe we can reduce them by a factor of 1000, let’s see.

I struggled reformatting the y axis. This is the resource that helped me identify the use of a function to re-scale it: * http://stackoverflow.com/questions/4646020/ggplot2-axis-transformation-by-constant-factor

 ggplot(aes(x=cand_nm, fill=party), data=ec) + 
  stat_count(width = 1)+
  scale_fill_manual(values=c("#232066", "#00A95C", "#E91D0E"),name="Party")+
  scale_y_continuous(labels=function(x)x/1000)+
  ggtitle('Number of Donations by Candidate')+
  labs(x='Candidate', y='Number of Donations/1,000')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I like this. I’m going to call it complete.

Final - Number of Contributions through the campaign

 ggplot(aes(x=cand_nm, fill=party), data=ec) + 
  stat_count(width = 1)+
  scale_fill_manual(values=c("#232066", "#00A95C", "#E91D0E"),name="Party")+
  scale_y_continuous(labels=function(x)x/1000)+
  ggtitle('Number of Donations by Candidate')+
  labs(x='Candidate', y='Number of Donations/1,000')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Total Donations over time

Here is the chart from earlier in the analysis:

ggplot(aes(x=contb_receipt_dt, y=n), 
       data= subset(ec.day_group, contb_receipt_dt >= as.Date("2015-03-01"))) +
  geom_line(aes(color=cand_nm))

Alright lets add some labels to these axis and legend.

ggplot(aes(x=contb_receipt_dt, y=n), 
       data= subset(ec.day_group, contb_receipt_dt >= as.Date("2015-03-01"))) +
  ggtitle('Number of donations over time')+
  labs(x='Date', y='Number of Donations')+
  geom_line(aes(color=cand_nm)) +
  scale_color_discrete(name='Candidate')

There are way too many people in this race to really show detail. So I’ll try and filter this down a bit more by making the date rage smaller and removing candidates that had dropped out of the race before the start of our new range.

# Create a list of candidates who were in the race on Jan 1, 2016
candidates = c("O'Malley, Martin Joseph",
               "Clinton, Hillary Rodham",
               "Sanders, Bernard",
               "Trump, Donald J.", 
               "Kasic, John R.", 
               "Cruz, Rafael Edward 'Ted'",
               "Rubio, Marco",
               "Carson, Benjamin S.",
               "Bush, Jeb",
               "Gilmore, James S IIII",
               "Christie, Christopher J.",
               "Fiorina, Carly",
               "Santorum, Richard J.",
               "Paul, Rand",
               "Huckabee, Mike") 

ggplot(aes(x=contb_receipt_dt, y=n), 
       data= subset(ec.day_group, 
                    contb_receipt_dt >= as.Date("2016-01-01") & cand_nm %in% candidates)) +
  ggtitle('Number of donations over time')+
  labs(x='Date', y='Number of Donations')+
  geom_line(aes(color=cand_nm)) +
  scale_color_discrete(name='Candidate')

Lets reduce the Y labels a bit, they are a little large:

ggplot(aes(x=contb_receipt_dt, y=n/1000), 
       data= subset(ec.day_group, 
                    contb_receipt_dt >= as.Date("2016-01-01") & cand_nm %in% candidates)) +
  ggtitle('Number of donations over time')+
  labs(x='Date', y='Number of Donations/1000')+
  geom_line(aes(color=cand_nm)) +
  scale_color_discrete(name='Candidate')

I like this, I’m going to call this one done!

Final - Total Donations over time

ggplot(aes(x=contb_receipt_dt, y=n/1000), 
       data= subset(ec.day_group, 
                    contb_receipt_dt >= as.Date("2016-01-01") & cand_nm %in% candidates)) +
  ggtitle('Number of donations over time')+
  labs(x='Date', y='Number of Donations/1,000')+
  geom_line(aes(color=cand_nm)) +
  scale_color_discrete(name='Candidate')

Contributions Around the latest Democratic Debate

Here is the chart from earlier in the analysis:

ggplot(aes(x=contb_receipt_dt, y=n), 
       data= subset(ec.day_group, 
                    contb_receipt_dt >= as.Date("2016-04-01") &
                      party == 'Democrat'
                    )) +
  geom_line(aes(color=cand_nm)) +
  geom_vline(xintercept = as.numeric(as.Date("2016-04-14")), colour = "blue")

As with the others, lets start with some labels. I decided that the line for O’Malley was distracting from the message of the plot so I removed his data, which will have to be addressed in a title change. I found information on adding the labels to the chart instead of relying on he legend here: http://www.r-bloggers.com/directlabels-adding-direct-labels-to-ggplot2-and-lattice-plots/ Also found some good info on the use of Annotate to add the label for the debate line: http://docs.ggplot2.org/0.9.3/annotate.html

ggplot(aes(x=contb_receipt_dt, y=n/1000), 
       data= subset(ec.day_group, 
                    contb_receipt_dt >= as.Date("2016-04-01") &
                      party == 'Democrat' & 
                      cand_nm != "O'Malley, Martin Joseph")) +
  geom_line(aes(color=cand_nm)) +
  labs(x='Date', y='Number of Donations/1000', title="Number Of Donations To The Two Leading Candidates\n Around The Latest Democratic Debate")+
  geom_text(data=ec.day_group[ec.day_group$contb_receipt_dt == as.Date("2016-04-30") & ec.day_group$cand_nm =="Sanders, Bernard",], aes(label=cand_nm, color=cand_nm), hjust=1, angle = 12, vjust=2)+
  geom_text(data=ec.day_group[ec.day_group$contb_receipt_dt == as.Date("2016-04-30") & ec.day_group$cand_nm =="Clinton, Hillary Rodham",], aes(label=cand_nm, color=cand_nm), hjust=1, angle = 5, vjust=-.7)+
  geom_vline(xintercept = as.numeric(as.Date("2016-04-14")), colour = "blue")+
   annotate("text", x = as.Date("2016-04-14"), y = 950, angle = 90, vjust= -.7, size=3, color="blue", label = "Debate April 14, 2016")

Let’s remove the legend: http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/ since it is redundant at this point.

ggplot(aes(x=contb_receipt_dt, y=n/1000), 
       data= subset(ec.day_group, 
                    contb_receipt_dt >= as.Date("2016-04-01") &
                      party == 'Democrat' & 
                      cand_nm != "O'Malley, Martin Joseph")) +
  geom_line(aes(color=cand_nm)) +
  labs(x='Date', y='Number of Donations/1000', title="Number Of Donations To The Two Leading Candidates\n Around The Latest Democratic Debate")+
  geom_text(data=ec.day_group[ec.day_group$contb_receipt_dt == as.Date("2016-04-30") & ec.day_group$cand_nm =="Sanders, Bernard",], aes(label=cand_nm, color=cand_nm), hjust=1, angle = 10, vjust=2)+
  geom_text(data=ec.day_group[ec.day_group$contb_receipt_dt == as.Date("2016-04-30") & ec.day_group$cand_nm =="Clinton, Hillary Rodham",], aes(label=cand_nm, color=cand_nm), hjust=1, angle = 5, vjust=-.7)+
  geom_vline(xintercept = as.numeric(as.Date("2016-04-14")), colour = "blue")+
   annotate("text", x = as.Date("2016-04-14"), y = 950, angle = 90, vjust= -.7, size=3, color="blue", label = "Debate April 14, 2016") +
  theme(legend.position="none")

This looks good to me, I’m going to call this final.

Final - Contributions Around the latest Democratic Debate

ggplot(aes(x=contb_receipt_dt, y=n/1000), 
       data= subset(ec.day_group, 
                    contb_receipt_dt >= as.Date("2016-04-01") &
                      party == 'Democrat' & 
                      cand_nm != "O'Malley, Martin Joseph")) +
  geom_line(aes(color=cand_nm)) +
  labs(x='Date', y='Number of Donations/1000', title="Number Of Donations To The Two Leading Candidates\n Around The Latest Democratic Debate")+
  geom_text(data=ec.day_group[ec.day_group$contb_receipt_dt == as.Date("2016-04-30") & ec.day_group$cand_nm =="Sanders, Bernard",], aes(label=cand_nm, color=cand_nm), hjust=1, angle = 10, vjust=2)+
  geom_text(data=ec.day_group[ec.day_group$contb_receipt_dt == as.Date("2016-04-30") & ec.day_group$cand_nm =="Clinton, Hillary Rodham",], aes(label=cand_nm, color=cand_nm), hjust=1, angle = 5, vjust=-.7)+
  geom_vline(xintercept = as.numeric(as.Date("2016-04-14")), colour = "blue")+
   annotate("text", x = as.Date("2016-04-14"), y = 950, angle = 90, vjust= -.7, size=3, color="blue", label = "Debate April 14, 2016") +
  theme(legend.position="none")

Reflection

This data set was interesting and frustrating at the same time. I started working on this project in November 2015 and as you can imagine the data has changed significantly from them to now. As the data set grew I quickly discovered that plotting the data unfiltered took my computer a long time, which made progress slow. Since I am brand new to R and using it to plot with ggplot2 there was quite a learning curve for me, and the extra cycles per attempt made iterating on ideas a little tough. Just the reality of doing analysis on larger sets of data.

I also feel like some of my charts are somewhat hacked together. By that I mean they work for my static data set, but as I know from working in my professional career charts are rarely used once and tossed. Likely plots like these would need to be updated on a regular basis. Things like the debate dates, which candidates had dropped out of the race by any particular date etc are hard coded in my solutions, and are apt to break if run with an updated set of data. For the project the solutions work, but if I were going to be using this information regularly I would put more effort into making the plots a bit less of one trick ponies and more of a tool to output a kind of chart.

I really enjoyed getting familiar with R. It amazes me that I’m using my 9 year old laptop to do analysis on data sets with 3.2 Million rows without too much trouble. ggplot is a really fun tool to use and I’m starting to get a good feel for the way to compose visualizations to get what you want. I still have a long way to go but this lesson has been very helpful in bringing me closer to being able to use these tools in my daily work.